-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Definitions.merge
and don't immediately validate definitions
#21746
Conversation
Definitions.merge
and don't immediately validate definitionsDefinitions.merge
and don't immediately validate definitions
@@ -326,7 +328,7 @@ class BindResourcesToJobs(list): | |||
""" | |||
|
|||
|
|||
class Definitions: | |||
class Definitions(DagsterModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I punted on thinking hard about whether this should be a DagsterModel
or a NamedTuple
. Don't have a strong opinion here.
self.get_inner_repository() | ||
|
||
@staticmethod | ||
def merge(*def_sets: "Definitions") -> "Definitions": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I punted on thinking hard about whether this should be *
args or a list. I don't have a strong opinion here.
|
||
def get_asset_graph(self) -> AssetGraph: | ||
"""Get the AssetGraph for this set of definitions.""" | ||
return self.get_repository_def().asset_graph | ||
|
||
def validate_loadable(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially this should return self
so that users can do:
defs = Definitions(...).validate_loadable()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential thing we can do here is make a trivial subclass of Definitions
which is ValidatedDefinitions
, and make this function def validated(self) -> ValidatedDefinitions:
.
That way, we can use the type system to enforce that we're working with a guaranteed-to-be-valid definitions object at certain points in time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NewType
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much like this idea. But couldn't find places in the codebase where we would use it yet. So planning to punt on this in this PR. Let me know if you think we should do otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is long past due and I am very excited that you are pressing on this 👍🏻
I think the thing you need to press on here is the resource system. It is going to be the hardest part of this in my estimation. Namely what is the behavior if you merge Definitions
objects that have overlapping resource keys. If the user want to merge them, and they do not have control over the definitions they are importing, what is the resolution here? Do we somehow scope them?
Namely imagine two assets that have the same resource key, but expect different types.
from dagster import asset
# imagine importing the "clients" from third party libraries you cannot change
class ClientFromLibOne:
def execute(...) -> None
class ClientFromLibTwo:
def invoke(...) -> None:
@asset
def asset_one(client: ResourceParam[ClientFromLibOne]):
client.execute()
@asset
def asset_two(client: ResourceParam[ClientFromLibTwo]):
client.invoke()
Definitions(asset_one, resources={"client": ClientFromLibOne()}).merge(Definitions(asset_two, resources={"client_two": ClientFromLibTwo()})
Imagine those assets came from factories that the user doesn't have control either. An internal module owned by another team or something. What is the behavior? How can they get this to work.
@schrockn I see a couple options here:
After writing them out, I’m going to make the argument that we should expect users to deal with avoiding collisions on their own, without framework help, for the time being. OptionsResource scopingPossible directions:
Risks:
Resource mappingObjects that rely on resources can have both an "inner" resource key and an "outer" resource key:
These objects each have a Risks:
Expecting users to deal with collisions on their ownBecause of their risks, I don’t think we should do either of the above options right now. Rather, I believe that we should expect libraries that expose definition factories to allow users to determine the resource keys that those definitions depend on. Our dbt and dlt integrations already do this. This direction is a two way door that gives us the option to introduce resource remapping or off-by-default scoping in the future.
When we initially allowed and introduced |
1edb52b
to
872416e
Compare
Awesome thanks for writing this up @sryza. Given the options presented definitely agree with your assessment. I'd like to throw out another possibility: Immediately binding resources to whatever definitions are passed into Definitions, and then letting the internal framework just handle collisions. defs_1 = Definitions([[some_def], resources={"a": AInstance(some_value=1)})
defs_2 = Definitions([[another_def], resources={"a": AInstance(some_value=2)})
def = defs_1.merge(defs_2) At runtime, I think this would be intuitive for users, but impose some framework complexity as we would need to bookkeep the scoping. We would also need to rethink the Resource UI, which I think is since it isn't that great or central right now. |
@schrockn fwiw I'd slot that into the "resource scoping" family of solutions I discussed above – it scopes resources to the The main challenges I see here:
More of an internal issue than a user-facing issue, but I think this could also be very heavy lift, because assets from different |
100%. This would require frontend/product/tooling work
I actually don't think this would be that bag. I think we would frame it was immediately binding resources to any definition, and subsequent merges can override.
Yup 100% |
Yes exactly. This what I mean by "impose some framework complexity" |
Tl;dr I think we should move forward with this proposal with the understanding that we might need to add support for resources merging, which I believe we can layer on additively. Let me know when PR is ready for review and will take a look! |
700b1fa
to
befbb17
Compare
Definitions.merge
and don't immediately validate definitionsDeinitions.merge
and don't immediately validate definitions
|
||
def get_asset_graph(self) -> AssetGraph: | ||
"""Get the AssetGraph for this set of definitions.""" | ||
return self.get_repository_def().asset_graph | ||
|
||
@public |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this PR proposes making this method public and non-experimental. This is to avoid a regression in users' ability to validate their definitions.
So somewhat high stakes.
@schrockn @OwenKephart – taking this out of RFC state and should be ready for another round of review. |
branch-name: definitions-merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this is a big mental model change, I think it is important to work through how we would explain this in docs, but now and in a future world of blueprints/blocks etc.
Before we commit this, I think we should write prose that accompanies an example where a user constructs more than one Definitions
objects, merges them, and then finally they get "bound" (or whatever) to produce the usable definition objects that live in repositories.
Right now I don't think we could write that prose with the code as it exists within this PR.
def validate_loadable(self) -> "Definitions": | ||
"""Validates that the enclosed definitions will be loadable by Dagster: | ||
- No assets have conflicting keys. | ||
- No jobs, sensors, or schedules have conflicting names. | ||
- All asset jobs can be resolved. | ||
- All resource requirements are satisfied. | ||
|
||
Raises an error if any of the above are not true. | ||
|
||
Returns: | ||
Definitions: The definitions object, unmodified. | ||
""" | ||
self.get_inner_repository() | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not a fan of this pattern where a user is just magically expected to call this and it is unencoded in the type system. I'd prefer a top level utility function instead of a method. The existence of this sort of validate method implies that a user is supposed to call before using it or something, but that it is totally unenforced here.
A top-level staticmethod
would be preferable in my view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A top-level staticmethod would be preferable in my view.
Sounds good - I'll give that a whirl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw I'm imagining the main use case for this method is in unit tests. I.e. I'm not expecting that everyone would call it in their definitions.py.
Deinitions.merge
and don't immediately validate definitionsDefinitions.merge
and don't immediately validate definitions
@schrockn here's a take at this: https://www.notion.so/dagster/Definitions-merge-prose-e014320e59004417a07e11cf60c8ffea?showMoveTo=true&saveParent=true |
And here's a start at using this in DOP: https://github.com/dagster-io/internal/pull/9809/files. It's only half-done, but it makes the |
100% that diff looks awesome. |
I commented in the Notion but will do it here for posterity etc. Two thoughts.
|
And it's a little new terminology but this is an advanced, opt-in use case so I think that is ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to your queue. Pretty firmly convinced myself that we shouldn't change existing behavior and instead add a new static ctor that defers resource binding.
How are you envisioning that we'd implement this? A I have some resistance to this approach because I think it violates user expectations about the relationship between static constructors and the underlying constructor. I.e. I think the expectation is that static constructors will call the real constructor internally. And any non-optional validation within the real constructor is expressing fundamental constraints that instances of the class are expected to abide by. Stepping back a bit, my understanding of the main reason to frontload this kind of validation is to be able to catch errors earlier. In the standard case where someone is pointing a code location at a Definitions object and starting the web server or the CLI or whatever, I don't think validating at object construction time helps errors get caught earlier in a meaningful way: the validation happens anyway immediately after object construction, when Dagster tries to load it. I think the main place where deferring validation makes a difference is in unit tests. For that case, there will likely be some users who experience negative effects from this change until they update their tests to explicitly call the validation method. However, whatever choices we make here will be our APIs forever, so I think that some temporary user pain is worth it for an saner API. |
Just spitballing here, I would do some sort of yucky hack, but a super localized one, if we don't want to introduce another class. class Definitions:
@staticmethod
# would actually spell out args for type checking
def bundle(self, **kwargs):
with disable_eager_construction():
return Definitions(**kwargs)
# `__init__` identical signature as before
# can disable eager construction with disable_eager_construction
def __init__(self, ...): ... I think there are tons of reasons to want to do validate immediately in test cases, scripting, and other contexts outside of the webserver and other tools. One of the nice things about having |
From an internal perspective, I don't have any problem with maintaining the kind of hack you're suggesting. However, I think that the presence of / need for that hack is indicative of the way that the API would be subverting user expectations. IMO the mental model we expose to users will be a lot simpler if we can say "A |
Yeah understood I'm just concerned about shipping such a big behavior change. |
If we're not comfortable shipping that behavior change, brainstorming directions we could go in:
# submodule1.py
from .resources import snowflake_resource
submodule1_defs = Definitions(assets=[asset1], resources={"snowflake": SnowflakeResource()})
# submodule2.py
from .resources import snowflake_resource
submodule2_defs = Definitions(assets=[asset2], resources={"snowflake": SnowflakeResource()})
# definitions.py
from .submodule1 import submodule1_defs
from .submodule2 import submodule2_defs
defs = Definitions.merge(submodule1_defs, submodule2_defs) The last one is actually pretty tempting to me. One drawback is that it would make it harder to create something like this: class DbtJobBlueprint(Blueprint):
name: str
select: str
def build_defs(self) -> Definitions:
job = define_asset_job(name=self.name, selection=DbtManifestAssetSelection(self.select))
return Definitions(jobs=[job]) |
For me as well. I think it might provided some good "guardrails" that nudges people to not have massive dictionaries of resources, but instead bind resources as close to definitions that consume them as possible, which might be good. At a minimum it is an interesting place to start, and it is purely an additive capability. For the Blueprint use case, I recommend that we introduce a blueprint-specific class (for now) to unblock progress there. We can easily re-examine later. I also suspect there will unique constraints there especially as we make an effort to make those definitions dynamically reloadable. Separating that from mainline |
@schrockn definitely agree that to unblock the blueprints work, need a separate class. I've already added that here: https://github.com/dagster-io/dagster/pull/21980/files#diff-ccaf56ec57423d804ed68d114ca1c21c033a919326296f40f7ac0f215d4343e2R28. My hope is still that we can eliminate that class entirely before going big with blueprints. A simple bag of definitions feels like such a basic and widely applicable object. |
100%. I just don't want work on Blueprints to be blocked for even one second based on this PR. Agree that eliminating the special blueprint-only should be done before launching Blueprints to more than a few design partners. |
To preserve the original description here for historical purposes, spinning up a new PR for the version where we still do immediately validate definitions: #22035. |
## Summary & Motivation There are diverse situations where it's useful to be able to pass around collections of definitions. Doing this in Dagster right now is quite awkward. Relevant situations: - Writing a factory that returns a few assets, a few checks, and a job that executes them together. - A submodule defines a few assets, a job, and a schedule. A different submodule does the same. They should all be in the same code location. - #21387 (comment) - dagster-io/internal#8216 Unlike the [prior version of this PR](#21746), this PR does _not_ turn `Definitions` into a "dumb" data class. I.e. it still validates that all assets and jobs can be bound, at construction time. ## How I Tested These Changes
## Summary & Motivation There are diverse situations where it's useful to be able to pass around collections of definitions. Doing this in Dagster right now is quite awkward. Relevant situations: - Writing a factory that returns a few assets, a few checks, and a job that executes them together. - A submodule defines a few assets, a job, and a schedule. A different submodule does the same. They should all be in the same code location. - dagster-io#21387 (comment) - dagster-io/internal#8216 Unlike the [prior version of this PR](dagster-io#21746), this PR does _not_ turn `Definitions` into a "dumb" data class. I.e. it still validates that all assets and jobs can be bound, at construction time. ## How I Tested These Changes
## Summary & Motivation This PR proposes turning Definitions into a "dumb" data class for passing around diverse sets of definitions. This means that users won't encounter any validation errors for unsatisfied resources, duplicate asset keys, etc., until either: - They run a code server to load the definitions - They invoke `Definitions.validate_loadable` Why do this? - Less work will happen at module import time. This gives users more flexibility to determine where they want to incur the cost of this validation. It also helps enable a future world where run/step workers only need to load/validate the subset of definitions that they need to execute. - Less repetition of resources. This was sited as an issue by the developers of our internal platform: https://dagsterlabs.slack.com/archives/C03AF3MMEEN/p1720546166484669?thread_ts=1720545875.842949&cid=C03AF3MMEEN. - Makes it possible to define asset jobs / schedules in separate `Definitions` objects from the assets they target. I encountered this when trying to write an [example that that allows using blueprints to define dbt selection jobs](#22908). The original Definitions.merge PR contains some discussion on this topic: #21746. ## How I Tested These Changes
Summary & Motivation
What
This PR proposes turning
Definitions
into a "dumb" data class that primarily holds collections of definitions. And it adds a staticmerge
method for merging multipleDefinitions
objects into a singleDefinitions
object.A big change here is that we no longer validate that a
Definitions
object is loadable as a code location at construction time. Instead, avalidate_loadable
method is offered. This enables:Definitions
object and asset definitions in anotherDefinitions
object and later merging them together.Definitions
object and asset definitions in anotherDefinitions
object and later merging them together.A beneficial side effect of this is that less work will happen at module import time. This gives users more flexibility to determine where they want to incur the cost of this validation. It also likely helps move towards a future world where run/step workers only need to load/validate the subset of definitions that they need.
An alternative could be introducing something like an
UnresolvedDefinitions
class. Though I'm having trouble coming up with a name that makes this feel ergonomic.Why
There are diverse situations where it's useful to be able to pass around collections of definitions. Doing this in Dagster right now is quite awkward.
Relevant situations:
How I Tested These Changes